LEDA-SM: external memory algorithms and data structures in theory and practice
نویسنده
چکیده
Data to be processed is getting larger and larger. Nowadays, it is necessary to store these huge amounts of data in external memory (mostly hard disks), as their size exceeds the internal, main memory of today’s computers. These large amounts of data pose different requirements to algorithms and data structures. Many existing algorithms and data structures are developed for the RAM model [AHU74]. The central features of this model are that memory is infinitely large and that access to different memory cells is of unit cost. The RAM model has been and is still used to analyze algorithms that run in main memory. External memory, however, has different features than main memory: an access to external memory is up to 100,000 times slower than an access to main memory or cache memory. Furthermore, an access to external memory always delivers a block of data. Thus, external memory algorithms access two memory layers (main and external memory) that have different access times and features so that assuming unit cost memory access is questionable. As a result, most RAM algorithms behave very inefficient when transfered to the external memory setting. This comes from the fact that they normally do not rely on locality of reference when accessing their data and therefore cannot profit from blockwise access to external memory. As a consequence, special external memory algorithms and data structures were developed. In this thesis, we develop external memory algorithms and data structures. Development consists of theoretical analysis as well as practical implementation.The first chapter is used to give an overview. In the second chapter, we explain the functionality of external memory that is realized by hard disks. We then explain the functionality of file systems at the example of Solaris’ UFS file system. At the end of chapter two, we introduce the most popular external memory I/O models, which are Vitter and Shriver’s I/O model and the extension of Farach et al. In Vitter and Shriver’s model, a computer consists of a bounded internal memory of size M , external memory is realized by D independent disk drives. An access to external memory (shortly called I/O) transfers up to D B items (1 D B M=2) to or from internal memory, B items from or to each disk at a time. Farach et al.’s model additionally allows to classify I/Os into (i) I/Os to random locations (random I/Os) and (ii) I/Os to consecutive locations (bulk I/Os). Bulk I/Os are faster than random I/Os due to caching and prefetching of modern disk drives. In both models, algorithmic performance is measured by counting (i) the number of executed I/Os, (ii) the number of CPU instructions (using the RAM model), and (iii) by counting the used disk space in external memory. In the third chapter, we introduce our new C++ class library LEDA-SM. Library LEDA-SM offers a collection of external memory algorithms and data structures. By fast prototyping it is possible to quickly develop new external memory applications. Library LEDA-SM is designed in a modular way. The so called kernel is responsible for the realization and the access to external memory. In LEDA-SM, external memory is either realized by files of the file system or by hard disks themselves. We realize an explicit mapping of Vitter and Shriver’s I/O model, i.e. external memory is divided into blocks of size B and each I/O transfers a multiple of the block size. The kernel furthermore offers interfaces that allow to access disk blocks and that allow to read or write blocks of data. The application part of the library consists of a collection of algorithms and data structures, which are developed to work in external memory. The implementation of these algorithms only uses C++ classes of LEDA-SM and of the C++ internal memory class library LEDA. We first describe the main design and implementation concepts of LEDA-SM, we then show the implementation of an external memory data structure and give first performance results. In the last two chapters we derive new external memory algorithms and data structures. The first case study covers external memory priority queues. We theoretically analyze (using the I/O models of Chapter 2) and experimentally compare (using LEDA-SM) state-of-the-art priority queues for internal and external memory. We furthermore propose two new external memory priority queues. Our first variant, Rheaps, is an extension of Ahuja et al.’s redistributive heaps towards secondary memory. It needs nonnegative integer priorities and furthermore assumes that the sequence of deleted minima is nondecreasing. Additionally, all elements, currently stored in the heap, must have priorities in an interval [a; : : : ; C], where a is the priority value of the last deleted minimum (zero otherwise). This requirements are for examples fulfilled in Dijkstra’s shortest path algorithm. This heap is very fast in practice, space optimal in the external memory model, but unfortunately only suboptimal in the multi disk setting of Vitter’s and Shriver’s D-disk model. Radix heaps support insert of a new element in O((1=B) amortized I/Os and delete minimum in O((1=B) logM=(B logC)(C)) amortized I/Os. Our second proposal, called array heap, is based on a sequence of sorted arrays. This variant reaches optimality in the D-disk I/O model and is also disk space optimal. There are no restrictions according to the priority type, but the size of the heap (number of stored elements) is restricted. Array heaps supports insert in 18=B(logcM=B(N=B) amortized I/Os and delete minimum in 7=B amortized I/Os. We analyze two variants with different size restrictions and show that the size restriction does not play an important role in practical applications. In the experimental setting, we compare our LEDA-SM implementation of both new approaches against other well known internal and external memory priority queues, such as Fibonacci heaps, k–ary heaps, buffer trees and B-trees. Our second case study covers external memory construction algorithms for suffix arrays, a full text indexing data structure. Suffix arrays were introduced by Manber and Myers and allow to perform full text search on texts. Suffix arrays are an important base data structure as other important full text indexing data structures, for example SB-trees of Ferragina and Grossi, can directly built by using the suffix array. Additionally, suffix arrays are among the most space efficient full text indexing data structures in external memory. Unfortunately, the construction algorithm of Manber and Myers is not efficient if transfered to the external memory setting because it does not exploit locality of reference in its data structures. We analyze two well-known construction algorithms, namely Karp-Miller-Rosenberg’s repeated doubling algorithm and BaezaYates-Snider’s construction algorithm. We use the I/O model of Vitter and Shriver as well as the extension of Farach et al. to analyze the number of I/Os (including bulk and random I/Os) and the used space of different construction algorithms. We furthermore develop three new construction algorithms that all run in the same I/O bound as the repeated doubling algorithm (O((N=B) logM=B(N=B)) I/Os) but use less space. All construction algorithms are implemented using LEDA-SM and tested on real world data and artificial input. We conclude the case study by addressing two issues: we first show that all construction algorithms can be used to construct word indexes. Secondly, we improve the performance of Baeza-Yates-Snider’s construction algorithm. In the worst case, this algorithm performs a cubic number of I/Os that is cubic in the size of the text. We show that this can be reduced to a quadratic number of I/Os.
منابع مشابه
LEDA-SM Extending LEDA to Secondary Memory
During the last years, many software libraries for in-core computation have been developed. Most internal memory algorithms perform very badly when used in an external memory setting. We introduce LEDA-SM that extends the LEDA-library [22] towards secondary memory computation. LEDA-SM uses I/O-efficient algorithms and data structures that do not suffer from the so called I/O bottleneck. LEDA is...
متن کاملParleda: a Library for Parallel Processing in Computational Geometry Applications
ParLeda is a software library that provides the basic primitives needed for parallel implementation of computational geometry applications. It can also be used in implementing a parallel application that uses geometric data structures. The parallel model that we use is based on a new heterogeneous parallel model named HBSP, which is based on BSP and is introduced here. ParLeda uses two main lib...
متن کاملChaotic Genetic Algorithm based on Explicit Memory with a new Strategy for Updating and Retrieval of Memory in Dynamic Environments
Many of the problems considered in optimization and learning assume that solutions exist in a dynamic. Hence, algorithms are required that dynamically adapt with the problem’s conditions and search new conditions. Mostly, utilization of information from the past allows to quickly adapting changes after. This is the idea underlining the use of memory in this field, what involves key design issue...
متن کاملExternal Memory Algorithms and Data
Data sets in large applications are often too massive to t completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this paper, we survey the state of the art in the design and analysis of external memory algorithms and data structures (whi...
متن کاملLEDA: A Library of Efficient Data Types and Algorithms
LEDA is a library of efficient data types and algorithms. At present, its strength is graph algorithms and related data structures. The computational geometry part is evolving. The main features of the library are • a clear separation of specification and implementation • parameterized data types • its extendibility • its ease of use. At present, the data types stack, queue, list, set, dictiona...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001